Add U8 copy operation for K16 MMA #374

aacostadiaz · 2025-05-14T17:14:24Z

This PR adds the U8 copy operation that works correctly with the K16 MMA for FP8 GEMM or mixed dtype GEMM.

# Conflicts: # include/cute/arch/xe_copy_1B.hpp # include/cute/arch/xe_copy_2B.hpp # include/cute/arch/xe_copy_4B.hpp

# Conflicts: # include/cute/arch/mma_xe.hpp

sanchitintel · 2025-05-21T21:11:42Z

With FP8xFP8 GEMM, this config didn't work, but the corresponding code works for FP16xFP16 GEMM:

  using GmemTiledCopyA = XE_2D_U8x32x32_LD_N;
  using GmemTiledCopyB = XE_2D_U8x32x32_LD_V;

  using TileShape = Shape<_64, _256, _32>;

  using TiledMma =
      typename TiledMMAHelper<MMA_Atom<XE_8x16x16_F32F16F16F32_TT>, Layout<TileShape>,
      Layout<Shape<_2, _8, _1>, Stride<_8, _1, _0>>>::TiledMMA;

The compile-time error was

include/cute/atom/copy_traits_xe.hpp:78:19: error: static assertion failed due to requirement 'size(cute::Layout<cute::tuple<cute::C<16>, cute::C<8>>, cute::tuple<cute::C<0>, cute::C<1>>>{}) % size(cute::tuple<cute::C<8>, cute::C<64>>{}) == 0'
   78 |     static_assert(size(LayoutIn{}) % size(BlockShape{}) == 0);

It seems to be a bug since the shapes are correct.

Thanks!

…ked-copy # Conflicts: # CMakeLists.txt # include/cute/arch/copy_xe_U16.hpp # include/cute/arch/copy_xe_U32.hpp # include/cute/arch/copy_xe_U4.hpp # include/cute/arch/copy_xe_U64.hpp # include/cute/arch/copy_xe_U8.hpp # include/cute/arch/copy_xe_builtin.hpp # include/cute/arch/copy_xe_spirv.hpp # include/cutlass/epilogue/collective/xe_epilogue.hpp

sanchitintel · 2025-05-27T21:02:07Z

include/cute/arch/copy_xe_U8.hpp

+struct XE_2D_U8x32x32_LD_N {
+  using BlockShape = Shape<_32, _32>;
+
+  template <class T>
+  CUTE_HOST_DEVICE static void copy(const void *baseoffset, int width,
+                                    int height, int pitch, intel::coord_t coord,
+                                    T *dst) {
+#if defined(CUTE_ARCH_COPY_XE_ENABLED)
+    static_assert(sizeof(T) == 1, "Expected T to have size 1");
+    // detail::XeSubgroup2DBlockLoad<1, 16, 32, 2>{}(baseoffset, width, height, pitch, coord, dst);
+    // Use the transform (VNNI) version as it provides better performance when loading the A matrix for
+    // GEMM FP8 and GEMM mixed-precision types.


Hi @aacostadiaz,

Please help resolve a couple of doubts.

The DstLayout in atom traits for this copy atom is Layout<Shape <_16,Shape <_8, _2, _32>>, Stride<_16,Stride< _1,_128,_256>>>;, which seems to correspond to plain layout. So, does this mean that initially, when the data would be copied from global memory, it'd be transformed into VNNI layout before writing to the registers, and would later be converted to DstLayout? If yes, can you please point out where/how it's handled in the code?

Also, I don't see any shfl based instructions in the generated assembly dump, so is it possible that the shuffle (for VNNI -> plain layout conversion) may not be happening directly via lane registers -> lane registers (I understand this isn't possible on Nvidia GPUs, but is somehow possible on Intel GPUs, based on the documentation) but lane registers -> shared local memory -> lane registers?

Thanks!

cc @pengzhao-intel @yuankuns

The Copy trait is used to describe how a copy operation works so that the rest of the code can understand it. It does not change how the actual copy operation works.

In this case, for the VNNI copies the transformation happens inside the builtin/spirv function. There is no transformation inside cutlass for that. We just use these builtin/spirv functions and the copy traits describe how these functions work.

@aacostadiaz, thanks, but I meant that since A for FP8 GEMM is being loaded in VNNI layout in this PR, and the GEMM output is correct, that seems to suggest that the layout must've been changed from VNNI to plain somewhere in the code.

In this case, for the VNNI copies the transformation happens inside the builtin/spirv function

Sorry, do you mean the VNNI -> plain transformation also happens inside the builtin? Thanks!

Yes, XeSubgroup2DBlockLoad<1, 16, 32, 2> and XeSubgroup2DBlockLoadTransform<1, 16, 32, 2> (Transform is VNNI transformation) are loading the exact same data and we end up with the exact same values in the registers. The only difference with XeSubgroup2DBlockLoadTransform<1, 16, 32, 2> is that the packing is 32 bits, so we get 32-bit elements out of the copy operation. If you recast this into four 8-bit elements you have the exact same information as with the XeSubgroup2DBlockLoad<1, 16, 32, 2> copy

Ah, since we load the data column-wise (from the POV of one work-item) with XeSubgroup2DBlockLoad, anyway, it doesn't matter whether we use XeSubgroup2DBlockLoadTransform or XeSubgroup2DBlockLoad (I haven't yet reasoned about whether or not it'd work for all relevant tile shapes, though. I'll do that later).

From https://github.khronos.org/SPIRV-Registry/extensions/INTEL/SPV_INTEL_2d_block_io.html,

cfgfung · 2025-05-28T02:43:57Z

Hi @aacostadiaz ,

vLLM team is blocked by this issue. Would you please prioritize this and merge this into the main branch?

joeatodd

Unsure about the Layout for the new operation, which looks like it might relate to @sanchitintel's comment.

Aside from that, just a nit suggestion.

include/cute/util/sycl_vec.hpp

joeatodd · 2025-05-29T08:42:26Z

include/cute/atom/copy_traits_xe.hpp

+  using SrcLayout = Layout<Shape <_16,Shape <_8,  _2, _32>>,
+                           Stride< _0,Stride< _1,_128,_256>>>;
+  // Map from (dst-thr,dst-val) to bit
+  using DstLayout = Layout<Shape <_16,Shape <_8,  _2, _32>>,
+                           Stride<_16,Stride< _1,_128,_256>>>;


It looks like XE_2D_Packed_U8x32x32_LD_N and XE_2D_U8x32x32_LD_N have the same *Layout traits. Is that expected?

I'll check out the copy_debug tool to verify why they look similar (they were same when you commented) & will report back with any findings. Thanks!

include/cute/atom/copy_traits_xe.hpp

Co-authored-by: Joe Todd <[email protected]>

Co-authored-by: Tadej Ciglarič <[email protected]>

include/cute/atom/copy_traits_xe.hpp

sanchitintel · 2025-05-30T08:10:38Z

examples/sycl/02_bmg_gemm_mixed_dtype/02_bmg_gemm_mixed_dtype.cpp

@@ -535,7 +535,7 @@ int main(int argc, const char** argv)
  using ElementScale = MmaType;

  // Note: XE_2D_U18x32x32_LD_N is incompatible with our bf16 MMA atoms


nit: this comment seems to be obsolete now.

# Conflicts: # include/cute/atom/copy_traits_xe.hpp

muhammad-tanvir-1211

LGTM, Thanks.

jiyang1011 and others added 22 commits April 7, 2025 19:12

spirv APIs

a6c8e53

mma spirv api

Loading
Loading status checks…

73bef6e

Merge branch 'sycl-develop' into jiyang/spirv_api

Loading
Loading status checks…

6e12cb6

Merge branch 'sycl-develop' into jiyang/spirv_api

Loading
Loading status checks…

626fd13

Merge branch 'sycl-develop' into jiyang/spirv_api

Loading
Loading status checks…

cf6a41b

remove -1 from OCL API

d9f8303

rebase

Loading
Loading status checks…

5537fd7

Disable spirv functions for PVC

Loading
Loading status checks…

c89a875

move spirv definitions

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

5e26dd3

fix

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

8c67947

Merge branch 'sycl-develop' into jiyang/spirv_api

Loading
Loading status checks…

1af7011

Refactor

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

879eb35

Fix cmake

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

9864ab2

Re-enable test

Loading
Loading status checks…

39e549d

Fix mma builtin

Loading
Loading status checks…

d6c9358

Fix copy builtin

Loading
Loading status checks…

ec9d0a7

Revert minor changes

Loading
Loading status checks…

7144422

Merge branch 'sycl-develop' into jiyang/spirv_api

3d30536

# Conflicts: # include/cute/arch/mma_xe.hpp

Use builtin for prefetch

Loading
Loading status checks…

4bbaaa6

Remove FP16 MMA with FP16 accumulator

Loading
Loading status checks…

304de17

Add U8 copy operation for K16 MMA

Loading
Loading status checks…

a2c45b1

aacostadiaz added the incremental label May 14, 2025

sanchitintel mentioned this pull request May 16, 2025

[BUG] XE_2D_U8x32x32_LD_N for copying A doesn't work for FP8 GEMM & for mixed dtype GEMM examples #357

Closed

aacostadiaz added 5 commits May 27, 2025 15:20

fix merge conflict

b962239

Revert changes in the tests

d8e855e

Update GEMM FP8 example

d0e2c94

Merge branch 'sycl-develop' into aacosta/packed-copy

Loading
Loading status checks…

d346207

aacostadiaz removed the incremental label May 27, 2025

sanchitintel reviewed May 27, 2025

View reviewed changes

Merge branch 'sycl-develop' into aacosta/packed-copy

Loading
Loading status checks…

ba60f3a

joeatodd reviewed May 29, 2025

View reviewed changes

t4c1 requested changes May 29, 2025

View reviewed changes

include/cute/atom/copy_traits_xe.hpp Show resolved Hide resolved

t4c1 reviewed May 29, 2025

View reviewed changes

include/cute/atom/copy_traits_xe.hpp Outdated Show resolved Hide resolved

aacostadiaz and others added 2 commits May 29, 2025 17:28

Update include/cute/util/sycl_vec.hpp

Loading
Loading status checks…

9dd7fa1

Co-authored-by: Joe Todd <[email protected]>

Update include/cute/atom/copy_traits_xe.hpp

Loading
Loading status checks…

3b06331

Co-authored-by: Tadej Ciglarič <[email protected]>

sanchitintel reviewed May 29, 2025

View reviewed changes

include/cute/atom/copy_traits_xe.hpp Show resolved Hide resolved

sanchitintel reviewed May 30, 2025

View reviewed changes

aacostadiaz added 2 commits June 3, 2025 13:23

Address comment

756a106

Merge branch 'sycl-develop' into aacosta/packed-copy

Loading
Loading status checks…

1efee30

# Conflicts: # include/cute/atom/copy_traits_xe.hpp

aacostadiaz added the release label Jun 3, 2025

t4c1 approved these changes Jun 4, 2025

View reviewed changes

Merge branch 'sycl-develop' into aacosta/packed-copy

Loading
Loading status checks…

7e06667

muhammad-tanvir-1211 approved these changes Jun 5, 2025

View reviewed changes

aacostadiaz merged commit bd36458 into codeplaysoftware:sycl-develop Jun 6, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add U8 copy operation for K16 MMA #374

Add U8 copy operation for K16 MMA #374

aacostadiaz commented May 14, 2025

Uh oh!

sanchitintel commented May 21, 2025 •

edited

Loading

Uh oh!

sanchitintel May 27, 2025 •

edited

Loading

Uh oh!

aacostadiaz May 29, 2025

Uh oh!

sanchitintel May 30, 2025 •

edited

Loading

Uh oh!

aacostadiaz Jun 4, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

sanchitintel Jun 5, 2025 •

edited

Loading

Uh oh!

cfgfung commented May 28, 2025

Uh oh!

joeatodd left a comment

Uh oh!

Uh oh!

joeatodd May 29, 2025

Uh oh!

sanchitintel May 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sanchitintel May 30, 2025 •

edited

Loading

Uh oh!

muhammad-tanvir-1211 left a comment

Uh oh!

Uh oh!

		@@ -535,7 +535,7 @@ int main(int argc, const char** argv)
		using ElementScale = MmaType;

		// Note: XE_2D_U18x32x32_LD_N is incompatible with our bf16 MMA atoms

Add U8 copy operation for K16 MMA #374

Add U8 copy operation for K16 MMA #374

Conversation

aacostadiaz commented May 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sanchitintel commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sanchitintel May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aacostadiaz May 29, 2025

Choose a reason for hiding this comment

Uh oh!

sanchitintel May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aacostadiaz Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

sanchitintel Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cfgfung commented May 28, 2025

Uh oh!

Uh oh!

joeatodd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joeatodd May 29, 2025

Choose a reason for hiding this comment

Uh oh!

sanchitintel May 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sanchitintel May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

muhammad-tanvir-1211 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sanchitintel commented May 21, 2025 •

edited

Loading

sanchitintel May 27, 2025 •

edited

Loading

sanchitintel May 30, 2025 •

edited

Loading

sanchitintel Jun 5, 2025 •

edited

Loading

sanchitintel May 30, 2025 •

edited

Loading